Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]
SUMMARY: [Sample Paragraph - The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Connectionist Bench dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.]
INTRODUCTION: [Sample Paragraph - The data file patterns obtained by bouncing sonar signals off a metal cylinder or a rock at various angles and under various conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock. Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time.]
ANALYSIS: [Sample Paragraph - The baseline performance of the machine learning algorithms achieved an average accuracy of 77.62%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an accuracy metric of 80.85%. After applying the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an accuracy of 80.65%, which was slightly below the prediction accuracy gained from the training data.]
CONCLUSION: [Sample Paragraph - For this iteration, the Gradient Boosting algorithm achieved the best overall training and validation results. For this dataset, the Gradient Boosting algorithm could be considered for further modeling.]
Dataset Used: [Connectionist Bench (Sonar, Mines vs. Rocks) Data Set]
Dataset ML Model: Binary classification with [numerical | categorical] attributes
Dataset Reference: [https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29]
One potential source of performance benchmarks: [https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29]
The project aims to touch on the following areas:
Any predictive modeling machine learning project genrally can be broken down into about six major tasks:
startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(corrplot)
## corrplot 0.84 loaded
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(mailR)
## Registered S3 method overwritten by 'R.oo':
## method from
## throw.default R.methodsS3
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(stringr)
# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)
email_notify <- function(msg=""){
sender <- Sys.getenv("MAIL_SENDER")
receiver <- Sys.getenv("MAIL_RECEIVER")
gateway <- Sys.getenv("SMTP_GATEWAY")
smtpuser <- Sys.getenv("SMTP_USERNAME")
password <- Sys.getenv("SMTP_PASSWORD")
sbj_line <- "Notification from R Binary Classification Script"
send.mail(
from = sender,
to = receiver,
subject= sbj_line,
body = msg,
smtp = list(host.name = gateway, port = 587, user.name = smtpuser, passwd = password, ssl = TRUE),
authenticate = TRUE,
send = TRUE)
}
# Set up the muteEmail flag to stop sending progress emails (setting FALSE will send emails!)
muteEmail <- FALSE
if (!muteEmail) if (!muteEmail) email_notify(paste("Library and Data Loading has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5a39699c}"
# Slicing up the document path to get the final destination file name
dataset_path <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
doc_path_list <- str_split(dataset_path, "/")
dest_file <- doc_path_list[[1]][length(doc_path_list[[1]])]
if (!file.exists(dest_file)) {
# Download the document from the website
cat("Downloading", dataset_path, "as", dest_file, "\n")
download.file(dataset_path, dest_file, mode = "wb")
cat(dest_file, "downloaded!\n")
# unzip(dest_file)
# cat(dest_file, "unpacked!\n")
}
inputFile <- dest_file
colNames <- paste0("attr",1:60)
colNames <- c(colNames, 'targetVar')
xy_original <- read.csv(inputFile, sep=',', header=FALSE, col.names = colNames)
# Different ways of reading and processing the input dataset. Saving these for future references.
#x_train <- read.fwf("X_train.txt", widths = widthVector, col.names = colNames)
#y_train <- read.csv("y_train.txt", header = FALSE, col.names = c("targetVar"))
#y_train$targetVar <- as.factor(y_train$targetVar)
#xy_train <- cbind(x_train, y_train)
# Take a peek at the dataframe after the import
head(xy_original)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7 attr8 attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
## attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 3 0.6333 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 5 0.4152 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
## attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 3 0.7974 0.6737 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 5 0.4148 0.4292 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
## attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 3 0.8512 0.5045 0.1862 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 5 0.5103 0.5459 0.2881 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
## attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 3 0.4647 0.2587 0.2129 0.2222 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 5 0.1979 0.2444 0.1847 0.0841 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
## attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 3 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 5 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
## targetVar
## 1 R
## 2 R
## 3 R
## 4 R
## 5 R
## 6 R
sapply(xy_original, class)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr57 attr58 attr59 attr60 targetVar
## "numeric" "numeric" "numeric" "numeric" "factor"
sapply(xy_original, function(x) sum(is.na(x)))
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## 0 0 0 0 0 0 0
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## 0 0 0 0 0 0 0
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## 0 0 0 0 0 0 0
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## 0 0 0 0 0 0 0
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## 0 0 0 0 0 0 0
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## 0 0 0 0 0 0 0
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## 0 0 0 0 0 0 0
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## 0 0 0 0 0 0 0
## attr57 attr58 attr59 attr60 targetVar
## 0 0 0 0 0
# Not applicable for this iteration of the project
# Sample code for performing basic data cleaning tasks
# Dropping features
# xy_original$column_name <- NULL
# Mark missing values
# invalid <- 0
# xy_original$column_name[xy_original$column_name==invalid] <- NA
# Impute missing values
# column_median <- median(xy_original$column_name, na.rm = TRUE)
# xy_original$column_name[xy_original$column_name==0] <- column_median
# xy_original$column_name <- with(xy_original, impute(column_name, cholumn_median))
# Convert columns from one data type to another
# xy_original$column_name <- as.integer(xy_original$column_name)
# xy_original$column_name <- as.factor(xy_original$column_name)
# Take a peek at the dataframe after the cleaning
head(xy_original)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7 attr8 attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
## attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 3 0.6333 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 5 0.4152 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
## attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 3 0.7974 0.6737 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 5 0.4148 0.4292 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
## attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 3 0.8512 0.5045 0.1862 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 5 0.5103 0.5459 0.2881 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
## attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 3 0.4647 0.2587 0.2129 0.2222 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 5 0.1979 0.2444 0.1847 0.0841 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
## attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 3 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 5 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
## targetVar
## 1 R
## 2 R
## 3 R
## 4 R
## 5 R
## 6 R
sapply(xy_original, class)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr57 attr58 attr59 attr60 targetVar
## "numeric" "numeric" "numeric" "numeric" "factor"
sapply(xy_original, function(x) sum(is.na(x)))
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## 0 0 0 0 0 0 0
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## 0 0 0 0 0 0 0
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## 0 0 0 0 0 0 0
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## 0 0 0 0 0 0 0
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## 0 0 0 0 0 0 0
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## 0 0 0 0 0 0 0
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## 0 0 0 0 0 0 0
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## 0 0 0 0 0 0 0
## attr57 attr58 attr59 attr60 targetVar
## 0 0 0 0 0
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(xy_original)
# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization!
targetCol <- totCol
# Standardize the class column to the name of targetVar if applicable
#colnames(xy_original)[targetCol] <- "targetVar"
#xy_original$targetVar <- relevel(xy_original$targetVar,"pos")
# We create training datasets (xy_train, x_train, y_train) for various visualization and cleaning/transformation operations.
# We create testing datasets (xy_test, y_test) for various visualization and cleaning/transformation operations.
set.seed(seedNum)
# Create a list of the rows in the original dataset we can use for training
# Use 70% of the data to train the models and the remaining for testing/validation
training_index <- createDataPartition(xy_original$targetVar, p=0.70, list=FALSE)
xy_train <- xy_original[training_index,]
xy_test <- xy_original[-training_index,]
if (targetCol==1) {
x_train <- xy_train[,(targetCol+1):totCol]
y_train <- xy_train[,targetCol]
y_test <- xy_test[,targetCol]
} else {
x_train <- xy_train[,1:(totAttr)]
y_train <- xy_train[,totCol]
y_test <- xy_test[,totCol]
}
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 4
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row): 4 by 15
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"
if (!muteEmail) email_notify(paste("Library and Data Loading completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1cd072a9}"
To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.
if (!muteEmail) email_notify(paste("Data Summarization and Visualization has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6a5fc7f7}"
head(xy_train)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7 attr8 attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
## 7 0.0317 0.0956 0.1321 0.1408 0.1674 0.1710 0.0731 0.1401 0.2083 0.3513
## 8 0.0519 0.0548 0.0842 0.0319 0.1158 0.0922 0.1027 0.0613 0.1465 0.2838
## attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
## 7 0.1786 0.0658 0.0513 0.3752 0.5419 0.5440 0.5150 0.4262 0.2024 0.4233
## 8 0.2802 0.3086 0.2657 0.3801 0.5626 0.4376 0.2617 0.1199 0.6676 0.9402
## attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
## 7 0.7723 0.9735 0.9390 0.5559 0.5268 0.6826 0.5713 0.5429 0.2177 0.2149
## 8 0.7832 0.5352 0.6809 0.9174 0.7613 0.8220 0.8872 0.6091 0.2967 0.1103
## attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
## 7 0.5811 0.6323 0.2965 0.1873 0.2969 0.5163 0.6153 0.4283 0.5479 0.6133
## 8 0.1318 0.0624 0.0990 0.4006 0.3666 0.1050 0.1915 0.3930 0.4288 0.2546
## attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
## 7 0.5017 0.2377 0.1957 0.1749 0.1304 0.0597 0.1124 0.1047 0.0507 0.0159
## 8 0.1151 0.2196 0.1879 0.1437 0.2146 0.2360 0.1125 0.0254 0.0285 0.0178
## attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
## 7 0.0195 0.0201 0.0248 0.0131 0.0070 0.0138 0.0092 0.0143 0.0036 0.0103
## 8 0.0052 0.0081 0.0120 0.0045 0.0121 0.0097 0.0085 0.0047 0.0048 0.0053
## targetVar
## 1 R
## 2 R
## 4 R
## 6 R
## 7 R
## 8 R
dim(xy_train)
## [1] 146 61
sapply(xy_train, class)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr57 attr58 attr59 attr60 targetVar
## "numeric" "numeric" "numeric" "numeric" "factor"
summary(xy_train)
## attr1 attr2 attr3 attr4
## Min. :0.00150 Min. :0.00220 Min. :0.00240 Min. :0.00580
## 1st Qu.:0.01383 1st Qu.:0.01673 1st Qu.:0.01887 1st Qu.:0.02455
## Median :0.02300 Median :0.03165 Median :0.03470 Median :0.04405
## Mean :0.02938 Mean :0.03922 Mean :0.04527 Mean :0.05499
## 3rd Qu.:0.03658 3rd Qu.:0.04735 3rd Qu.:0.06028 3rd Qu.:0.06270
## Max. :0.13130 Max. :0.23390 Max. :0.30590 Max. :0.42640
## attr5 attr6 attr7 attr8
## Min. :0.00670 Min. :0.01020 Min. :0.0033 Min. :0.00550
## 1st Qu.:0.03572 1st Qu.:0.06797 1st Qu.:0.0860 1st Qu.:0.07855
## Median :0.06250 Median :0.09215 Median :0.1102 Median :0.11265
## Mean :0.07441 Mean :0.10360 Mean :0.1209 Mean :0.13236
## 3rd Qu.:0.09945 3rd Qu.:0.14005 3rd Qu.:0.1526 3rd Qu.:0.16900
## Max. :0.40100 Max. :0.27700 Max. :0.3016 Max. :0.45660
## attr9 attr10 attr11 attr12
## Min. :0.0117 Min. :0.0113 Min. :0.0289 Min. :0.0236
## 1st Qu.:0.0907 1st Qu.:0.1110 1st Qu.:0.1329 1st Qu.:0.1383
## Median :0.1466 Median :0.1842 Median :0.2248 Median :0.2478
## Mean :0.1731 Mean :0.2033 Mean :0.2326 Mean :0.2473
## 3rd Qu.:0.2347 3rd Qu.:0.2712 3rd Qu.:0.2984 3rd Qu.:0.3297
## Max. :0.6828 Max. :0.7106 Max. :0.7342 Max. :0.6552
## attr13 attr14 attr15 attr16
## Min. :0.0184 Min. :0.0273 Min. :0.0031 Min. :0.0162
## 1st Qu.:0.1754 1st Qu.:0.1860 1st Qu.:0.1802 1st Qu.:0.2051
## Median :0.2516 Median :0.2878 Median :0.3010 Median :0.3393
## Mean :0.2748 Mean :0.3037 Mean :0.3346 Mean :0.3945
## 3rd Qu.:0.3615 3rd Qu.:0.3948 3rd Qu.:0.4937 3rd Qu.:0.5433
## Max. :0.7022 Max. :0.9970 Max. :1.0000 Max. :0.9988
## attr17 attr18 attr19 attr20
## Min. :0.0349 Min. :0.0689 Min. :0.0494 Min. :0.0740
## 1st Qu.:0.2164 1st Qu.:0.2449 1st Qu.:0.3124 1st Qu.:0.3508
## Median :0.3508 Median :0.3803 Median :0.4480 Median :0.5443
## Mean :0.4357 Mean :0.4698 Mean :0.5168 Mean :0.5672
## 3rd Qu.:0.6827 3rd Qu.:0.6913 3rd Qu.:0.7426 3rd Qu.:0.8179
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## attr21 attr22 attr23 attr24
## Min. :0.0512 Min. :0.0689 Min. :0.0563 Min. :0.0239
## 1st Qu.:0.4035 1st Qu.:0.4295 1st Qu.:0.4600 1st Qu.:0.5307
## Median :0.6479 Median :0.6836 Median :0.7015 Median :0.6944
## Mean :0.6127 Mean :0.6379 Mean :0.6570 Mean :0.6748
## 3rd Qu.:0.8358 3rd Qu.:0.8537 3rd Qu.:0.8635 3rd Qu.:0.8748
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## attr25 attr26 attr27 attr28
## Min. :0.0395 Min. :0.0921 Min. :0.0481 Min. :0.0284
## 1st Qu.:0.5340 1st Qu.:0.5486 1st Qu.:0.5362 1st Qu.:0.5005
## Median :0.7061 Median :0.7501 Median :0.7462 Median :0.7124
## Mean :0.6723 Mean :0.7006 Mean :0.7015 Mean :0.6731
## 3rd Qu.:0.8641 3rd Qu.:0.8904 3rd Qu.:0.9143 3rd Qu.:0.8703
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## attr29 attr30 attr31 attr32
## Min. :0.0144 Min. :0.0613 Min. :0.0482 Min. :0.0404
## 1st Qu.:0.4426 1st Qu.:0.3825 1st Qu.:0.3043 1st Qu.:0.2766
## Median :0.6296 Median :0.5657 Median :0.4487 Median :0.3986
## Mean :0.6186 Mean :0.5614 Mean :0.4856 Mean :0.4267
## 3rd Qu.:0.8443 3rd Qu.:0.7310 3rd Qu.:0.6284 3rd Qu.:0.5607
## Max. :1.0000 Max. :1.0000 Max. :0.9657 Max. :0.9306
## attr33 attr34 attr35 attr36
## Min. :0.0477 Min. :0.0212 Min. :0.0223 Min. :0.0271
## 1st Qu.:0.2554 1st Qu.:0.2097 1st Qu.:0.1737 1st Qu.:0.1550
## Median :0.3897 Median :0.3372 Median :0.3042 Median :0.3184
## Mean :0.4076 Mean :0.3909 Mean :0.3874 Mean :0.3901
## 3rd Qu.:0.5381 3rd Qu.:0.5903 3rd Qu.:0.6110 3rd Qu.:0.5825
## Max. :0.9708 Max. :0.9647 Max. :1.0000 Max. :1.0000
## attr37 attr38 attr39 attr40
## Min. :0.0351 Min. :0.0383 Min. :0.0371 Min. :0.0117
## 1st Qu.:0.1673 1st Qu.:0.1722 1st Qu.:0.1671 1st Qu.:0.1782
## Median :0.3283 Median :0.2940 Median :0.2782 Median :0.2791
## Mean :0.3713 Mean :0.3323 Mean :0.3151 Mean :0.3094
## 3rd Qu.:0.5402 3rd Qu.:0.4423 3rd Qu.:0.4277 3rd Qu.:0.4299
## Max. :0.9497 Max. :1.0000 Max. :0.9857 Max. :0.9167
## attr41 attr42 attr43 attr44
## Min. :0.0360 Min. :0.0056 Min. :0.0159 Min. :0.0255
## 1st Qu.:0.1604 1st Qu.:0.1555 1st Qu.:0.1578 1st Qu.:0.1279
## Median :0.2661 Median :0.2415 Median :0.2364 Median :0.1764
## Mean :0.2854 Mean :0.2750 Mean :0.2471 Mean :0.2141
## 3rd Qu.:0.3939 3rd Qu.:0.3856 3rd Qu.:0.3197 3rd Qu.:0.2685
## Max. :0.7322 Max. :0.8246 Max. :0.7517 Max. :0.5772
## attr45 attr46 attr47 attr48
## Min. :0.0095 Min. :0.00250 Min. :0.00730 Min. :0.00410
## 1st Qu.:0.1074 1st Qu.:0.06897 1st Qu.:0.06283 1st Qu.:0.04537
## Median :0.1489 Median :0.12525 Median :0.10550 Median :0.07860
## Mean :0.2003 Mean :0.16084 Mean :0.12130 Mean :0.09138
## 3rd Qu.:0.2369 3rd Qu.:0.20413 3rd Qu.:0.15345 3rd Qu.:0.11925
## Max. :0.7034 Max. :0.72920 Max. :0.55220 Max. :0.33390
## attr49 attr50 attr51 attr52
## Min. :0.00210 Min. :0.00060 Min. :0.000900 Min. :0.001300
## 1st Qu.:0.02660 1st Qu.:0.01202 1st Qu.:0.009125 1st Qu.:0.007725
## Median :0.04515 Median :0.01840 Median :0.014750 Median :0.011250
## Mean :0.05242 Mean :0.02073 Mean :0.016533 Mean :0.013642
## 3rd Qu.:0.07130 3rd Qu.:0.02602 3rd Qu.:0.021475 3rd Qu.:0.016375
## Max. :0.16080 Max. :0.06370 Max. :0.100400 Max. :0.070900
## attr53 attr54 attr55
## Min. :0.00050 Min. :0.001800 Min. :0.001200
## 1st Qu.:0.00450 1st Qu.:0.006025 1st Qu.:0.003925
## Median :0.00935 Median :0.009300 Median :0.007500
## Mean :0.01035 Mean :0.011103 Mean :0.009260
## 3rd Qu.:0.01468 3rd Qu.:0.014500 3rd Qu.:0.012325
## Max. :0.03170 Max. :0.035200 Max. :0.037200
## attr56 attr57 attr58
## Min. :0.000600 Min. :0.00030 Min. :0.000600
## 1st Qu.:0.004800 1st Qu.:0.00405 1st Qu.:0.003900
## Median :0.007450 Median :0.00655 Median :0.006350
## Mean :0.008347 Mean :0.00798 Mean :0.008359
## 3rd Qu.:0.011075 3rd Qu.:0.01050 3rd Qu.:0.011200
## Max. :0.032600 Max. :0.02580 Max. :0.037700
## attr59 attr60 targetVar
## Min. :0.000200 Min. :0.000600 M:78
## 1st Qu.:0.004300 1st Qu.:0.003225 R:68
## Median :0.006400 Median :0.005350
## Mean :0.007888 Mean :0.006714
## 3rd Qu.:0.010475 3rd Qu.:0.008775
## Max. :0.033200 Max. :0.043900
sapply(xy_train, function(x) sum(is.na(x)))
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## 0 0 0 0 0 0 0
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## 0 0 0 0 0 0 0
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## 0 0 0 0 0 0 0
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## 0 0 0 0 0 0 0
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## 0 0 0 0 0 0 0
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## 0 0 0 0 0 0 0
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## 0 0 0 0 0 0 0
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## 0 0 0 0 0 0 0
## attr57 attr58 attr59 attr60 targetVar
## 0 0 0 0 0
cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
## freq percentage
## M 78 53.42466
## R 68 46.57534
# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
boxplot(x_train[,i], main=names(x_train)[i])
}
# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
hist(x_train[,i], main=names(x_train)[i])
}
# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
plot(density(x_train[,i]), main=names(x_train)[i])
}
# Scatterplot matrix colored by class
#pairs(targetVar~., data=xy_train, col=xy_train$targetVar)
# Box and whisker plots for each attribute by class
#scales <- list(x=list(relation="free"), y=list(relation="free"))
#featurePlot(x=x_train, y=y_train, plot="box", scales=scales)
# Density plots for each attribute by class value
#featurePlot(x=x_train, y=y_train, plot="density", scales=scales)
# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")
if (!muteEmail) email_notify(paste("Data Summarization and Visualization completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@511baa65}"
Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:
if (!muteEmail) email_notify(paste("Data Cleaning and Transformation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6fadae5d}"
# Not applicable for this iteration of the project
# Using the correlations calculated previously, we try to find attributes that are highly correlated.
# highlyCorrelated <- findCorrelation(correlations, cutoff=0.85)
# print(highlyCorrelated)
# cat('Number of attributes found to be highly correlated:',length(highlyCorrelated))
# Removing the highly correlated attributes from the training and validation dataframes
# xy_train <- xy_train[, -highlyCorrelated]
# xy_test <- xy_test[, -highlyCorrelated]
# Not applicable for this iteration of the project
# Sample code for performing feature selection by ranking the attributes' importance.
# startTimeModule <- proc.time()
# set.seed(seedNum)
# library(gbm)
# model_fs <- train(targetVar~., data=xy_train, method="gbm", preProcess="scale", trControl=control, verbose=F)
# rankedImportance <- varImp(model_fs, scale=FALSE)
# print(rankedImportance)
# plot(rankedImportance)
# Set the importance threshold and calculate the list of attributes that don't contribute to the importance threshold
# maxThreshold <- 0.99
# rankedAttributes <- rankedImportance$importance
# rankedAttributes <- rankedAttributes[order(-rankedAttributes$Overall),,drop=FALSE]
# totalWeight <- sum(rankedAttributes)
# i <- 1
# accumWeight <- 0
# exit_now <- FALSE
# while ((i <= totAttr) & !exit_now) {
# accumWeight = accumWeight + rankedAttributes[i,]
# if ((accumWeight/totalWeight) >= maxThreshold) {
# exit_now <- TRUE
# } else {
# i <- i + 1
# }
# }
# lowImportance <- rankedAttributes[(i+1):(totAttr),,drop=FALSE]
# lowAttributes <- rownames(lowImportance)
# cat('Number of attributes contributed to the importance threshold:',i,"\n")
# cat('Number of attributes found to be of low importance:',length(lowAttributes))
# Removing the unselected attributes from the training and validation dataframes
# xy_train <- xy_train[, !(names(xy_train) %in% lowAttributes)]
# xy_test <- xy_test[, !(names(xy_test) %in% lowAttributes)]
# Not applicable for this iteration of the project
# Sample code for perfoming feature selection using the Recursive Feature Elimination (RFE) technique
# startTimeModule <- proc.time()
# set.seed(seedNum)
# rfeCTRL <- rfeControl(functions=rfFuncs, method="cv", number=10)
# rfeResults <- rfe(xy_train[,1:totAttr], xy_train[,totCol], sizes=c(30:55), rfeControl=rfeCTRL)
# print(rfeResults)
# rfeAttributes <- predictors(rfeResults)
# cat('Number of attributes identified from the RFE algorithm:',length(rfeAttributes))
# print(rfeAttributes)
# plot(rfeResults, type=c("g", "o"))
# Removing the unselected attributes from the training and validation dataframes
# rfeAttributes <- c(rfeAttributes,"targetVar")
# xy_train <- xy_train[, (names(xy_train) %in% rfeAttributes)]
# xy_test <- xy_test[, (names(xy_test) %in% rfeAttributes)]
# Not applicable for this iteration of the project
# Sample code for performing SMOTE transformation to combat the unbalanced data
# set.seed(seedNum)
# xy_train <- SMOTE(targetVar ~., data=xy_train, perc.over=200, perc.under=300)
# totCol <- ncol(xy_train)
# y_train <- xy_train[,totCol]
# cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
dim(xy_train)
## [1] 146 61
dim(xy_test)
## [1] 62 61
sapply(xy_train, class)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr57 attr58 attr59 attr60 targetVar
## "numeric" "numeric" "numeric" "numeric" "factor"
if (!muteEmail) email_notify(paste("Data Cleaning and Transformation completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@35f983a6}"
proc.time()-startTimeScript
## user system elapsed
## 26.540 0.360 34.798
After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:
For this project, we will evaluate one linear, one non-linear, and three ensemble algorithms:
Linear Algorithm: Logistic Regression
Non-Linear Algorithm: Decision Trees (CART)
Ensemble Algorithms: Bagged CART, Random Forest, and Gradient Boosting
The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.
# Logistic Regression (Classification)
if (!muteEmail) email_notify(paste("Logistic Regression modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3498ed}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.glm <- train(targetVar~., data=xy_train, method="glm", metric=metricTarget, trControl=control)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(fit.glm)
## Generalized Linear Model
##
## 146 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7390476 0.4767065
proc.time()-startTimeModule
## user system elapsed
## 0.820 0.000 0.812
if (!muteEmail) email_notify(paste("Logistic Regression modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@38082d64}"
# Decision Tree - CART (Regression/Classification)
if (!muteEmail) email_notify(paste("Decision Tree modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@180bc464}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART
##
## 146 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.07352941 0.6847619 0.3640170
## 0.08823529 0.6776190 0.3492613
## 0.50000000 0.5833333 0.1200159
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.07352941.
proc.time()-startTimeModule
## user system elapsed
## 0.900 0.000 0.899
if (!muteEmail) email_notify(paste("Decision Tree modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6fb554cc}"
In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.
# Bagged CART (Regression/Classification)
if (!muteEmail) email_notify(paste("Bagged CART modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1936f0f5}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART
##
## 146 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7680952 0.5299798
proc.time()-startTimeModule
## user system elapsed
## 3.740 0.010 3.742
if (!muteEmail) email_notify(paste("Bagged CART modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@35fb3008}"
# Random Forest (Regression/Classification)
if (!muteEmail) email_notify(paste("Random Forest modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@737996a0}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest
##
## 146 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8357143 0.6634035
## 31 0.7747619 0.5451204
## 60 0.7880952 0.5716670
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
## user system elapsed
## 6.300 0.010 6.314
if (!muteEmail) email_notify(paste("Random Forest modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@13a57a3b}"
# Gradient Boosting (Regression/Classification)
if (!muteEmail) email_notify(paste("Gradient Boosting modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2669b199}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="xgbTree", metric=metricTarget, trControl=control, verbose=F)
# fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## eXtreme Gradient Boosting
##
## 146 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds Accuracy
## 0.3 1 0.6 0.50 50 0.8152381
## 0.3 1 0.6 0.50 100 0.8428571
## 0.3 1 0.6 0.50 150 0.8233333
## 0.3 1 0.6 0.75 50 0.8090476
## 0.3 1 0.6 0.75 100 0.7957143
## 0.3 1 0.6 0.75 150 0.8152381
## 0.3 1 0.6 1.00 50 0.8076190
## 0.3 1 0.6 1.00 100 0.8485714
## 0.3 1 0.6 1.00 150 0.8423810
## 0.3 1 0.8 0.50 50 0.8419048
## 0.3 1 0.8 0.50 100 0.8490476
## 0.3 1 0.8 0.50 150 0.8561905
## 0.3 1 0.8 0.75 50 0.8090476
## 0.3 1 0.8 0.75 100 0.8085714
## 0.3 1 0.8 0.75 150 0.8361905
## 0.3 1 0.8 1.00 50 0.8152381
## 0.3 1 0.8 1.00 100 0.8561905
## 0.3 1 0.8 1.00 150 0.8428571
## 0.3 2 0.6 0.50 50 0.8014286
## 0.3 2 0.6 0.50 100 0.8152381
## 0.3 2 0.6 0.50 150 0.8366667
## 0.3 2 0.6 0.75 50 0.8085714
## 0.3 2 0.6 0.75 100 0.8223810
## 0.3 2 0.6 0.75 150 0.8366667
## 0.3 2 0.6 1.00 50 0.8352381
## 0.3 2 0.6 1.00 100 0.8633333
## 0.3 2 0.6 1.00 150 0.8495238
## 0.3 2 0.8 0.50 50 0.8766667
## 0.3 2 0.8 0.50 100 0.8352381
## 0.3 2 0.8 0.50 150 0.8285714
## 0.3 2 0.8 0.75 50 0.8357143
## 0.3 2 0.8 0.75 100 0.8223810
## 0.3 2 0.8 0.75 150 0.8223810
## 0.3 2 0.8 1.00 50 0.8490476
## 0.3 2 0.8 1.00 100 0.8495238
## 0.3 2 0.8 1.00 150 0.8428571
## 0.3 3 0.6 0.50 50 0.8504762
## 0.3 3 0.6 0.50 100 0.8295238
## 0.3 3 0.6 0.50 150 0.8366667
## 0.3 3 0.6 0.75 50 0.8695238
## 0.3 3 0.6 0.75 100 0.8704762
## 0.3 3 0.6 0.75 150 0.8704762
## 0.3 3 0.6 1.00 50 0.8157143
## 0.3 3 0.6 1.00 100 0.8290476
## 0.3 3 0.6 1.00 150 0.8290476
## 0.3 3 0.8 0.50 50 0.8361905
## 0.3 3 0.8 0.50 100 0.8428571
## 0.3 3 0.8 0.50 150 0.8495238
## 0.3 3 0.8 0.75 50 0.8366667
## 0.3 3 0.8 0.75 100 0.8571429
## 0.3 3 0.8 0.75 150 0.8504762
## 0.3 3 0.8 1.00 50 0.8423810
## 0.3 3 0.8 1.00 100 0.8490476
## 0.3 3 0.8 1.00 150 0.8423810
## 0.4 1 0.6 0.50 50 0.8014286
## 0.4 1 0.6 0.50 100 0.8023810
## 0.4 1 0.6 0.50 150 0.8023810
## 0.4 1 0.6 0.75 50 0.8219048
## 0.4 1 0.6 0.75 100 0.8561905
## 0.4 1 0.6 0.75 150 0.8704762
## 0.4 1 0.6 1.00 50 0.8290476
## 0.4 1 0.6 1.00 100 0.8490476
## 0.4 1 0.6 1.00 150 0.8357143
## 0.4 1 0.8 0.50 50 0.8280952
## 0.4 1 0.8 0.50 100 0.8290476
## 0.4 1 0.8 0.50 150 0.8423810
## 0.4 1 0.8 0.75 50 0.8285714
## 0.4 1 0.8 0.75 100 0.8285714
## 0.4 1 0.8 0.75 150 0.8147619
## 0.4 1 0.8 1.00 50 0.8357143
## 0.4 1 0.8 1.00 100 0.8357143
## 0.4 1 0.8 1.00 150 0.8357143
## 0.4 2 0.6 0.50 50 0.8495238
## 0.4 2 0.6 0.50 100 0.8628571
## 0.4 2 0.6 0.50 150 0.8428571
## 0.4 2 0.6 0.75 50 0.8357143
## 0.4 2 0.6 0.75 100 0.8428571
## 0.4 2 0.6 0.75 150 0.8423810
## 0.4 2 0.6 1.00 50 0.8347619
## 0.4 2 0.6 1.00 100 0.8285714
## 0.4 2 0.6 1.00 150 0.8285714
## 0.4 2 0.8 0.50 50 0.8347619
## 0.4 2 0.8 0.50 100 0.8485714
## 0.4 2 0.8 0.50 150 0.8352381
## 0.4 2 0.8 0.75 50 0.8552381
## 0.4 2 0.8 0.75 100 0.8490476
## 0.4 2 0.8 0.75 150 0.8485714
## 0.4 2 0.8 1.00 50 0.8357143
## 0.4 2 0.8 1.00 100 0.8566667
## 0.4 2 0.8 1.00 150 0.8566667
## 0.4 3 0.6 0.50 50 0.8352381
## 0.4 3 0.6 0.50 100 0.8352381
## 0.4 3 0.6 0.50 150 0.8347619
## 0.4 3 0.6 0.75 50 0.8419048
## 0.4 3 0.6 0.75 100 0.8490476
## 0.4 3 0.6 0.75 150 0.8490476
## 0.4 3 0.6 1.00 50 0.8209524
## 0.4 3 0.6 1.00 100 0.8490476
## 0.4 3 0.6 1.00 150 0.8490476
## 0.4 3 0.8 0.50 50 0.8161905
## 0.4 3 0.8 0.50 100 0.8228571
## 0.4 3 0.8 0.50 150 0.8228571
## 0.4 3 0.8 0.75 50 0.8490476
## 0.4 3 0.8 0.75 100 0.8495238
## 0.4 3 0.8 0.75 150 0.8423810
## 0.4 3 0.8 1.00 50 0.8495238
## 0.4 3 0.8 1.00 100 0.8704762
## 0.4 3 0.8 1.00 150 0.8704762
## Kappa
## 0.6269515
## 0.6800768
## 0.6412784
## 0.6137278
## 0.5867604
## 0.6257157
## 0.6131517
## 0.6959970
## 0.6817465
## 0.6816369
## 0.6944030
## 0.7086888
## 0.6131820
## 0.6129151
## 0.6690120
## 0.6291544
## 0.7104835
## 0.6827579
## 0.6000520
## 0.6266617
## 0.6695189
## 0.6121573
## 0.6395144
## 0.6674526
## 0.6676661
## 0.7235815
## 0.6960347
## 0.7485045
## 0.6651003
## 0.6523107
## 0.6655254
## 0.6392402
## 0.6392402
## 0.6947622
## 0.6955366
## 0.6817817
## 0.6952238
## 0.6531452
## 0.6674309
## 0.7314825
## 0.7354981
## 0.7348902
## 0.6262422
## 0.6532780
## 0.6532780
## 0.6687733
## 0.6807599
## 0.6937972
## 0.6672048
## 0.7092217
## 0.6954625
## 0.6809859
## 0.6952478
## 0.6815077
## 0.6006864
## 0.6004031
## 0.6013369
## 0.6389682
## 0.7080999
## 0.7371583
## 0.6545138
## 0.6949743
## 0.6674754
## 0.6535142
## 0.6549530
## 0.6810870
## 0.6517208
## 0.6537727
## 0.6264717
## 0.6686793
## 0.6677145
## 0.6689379
## 0.6975957
## 0.7248493
## 0.6840757
## 0.6669667
## 0.6810134
## 0.6797586
## 0.6671311
## 0.6538916
## 0.6538916
## 0.6657127
## 0.6947822
## 0.6684791
## 0.7067686
## 0.6936395
## 0.6928610
## 0.6684227
## 0.7089366
## 0.7089366
## 0.6708530
## 0.6697102
## 0.6684659
## 0.6802525
## 0.6941618
## 0.6941618
## 0.6381358
## 0.6937986
## 0.6937986
## 0.6258671
## 0.6387655
## 0.6390223
## 0.6921711
## 0.6924451
## 0.6781594
## 0.6940700
## 0.7362397
## 0.7362397
##
## Tuning parameter 'gamma' was held constant at a value of 0
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 2,
## eta = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1
## and subsample = 0.5.
proc.time()-startTimeModule
## user system elapsed
## 36.270 0.760 19.699
if (!muteEmail) email_notify(paste("Gradient Boosting modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@23ceabc1}"
results <- resamples(list(LR=fit.glm, CART=fit.cart, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: LR, CART, BagCART, RF, GBM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 0.5333333 0.6607143 0.7666667 0.7390476 0.8000000 0.9285714 0
## CART 0.4666667 0.5892857 0.7238095 0.6847619 0.7726190 0.8666667 0
## BagCART 0.4000000 0.6785714 0.8000000 0.7680952 0.9107143 0.9333333 0
## RF 0.6000000 0.7892857 0.8333333 0.8357143 0.9321429 1.0000000 0
## GBM 0.6666667 0.7857143 0.9309524 0.8766667 0.9833333 1.0000000 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 0.08695652 0.3331202 0.5257208 0.4767065 0.5999761 0.8571429 0
## CART -0.05263158 0.1785714 0.4356061 0.3640170 0.5446429 0.7368421 0
## BagCART -0.19469027 0.3525836 0.5944629 0.5299798 0.8199405 0.8672566 0
## RF 0.16666667 0.5616826 0.6596494 0.6634035 0.8629344 1.0000000 0
## GBM 0.32432432 0.5577508 0.8621997 0.7485045 0.9668142 1.0000000 0
dotplot(results)
cat('The average accuracy from all models is:',
mean(c(results$values$`LR~Accuracy`,results$values$`CART~Accuracy`,results$values$`BagCART~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)))
## The average accuracy from all models is: 0.7808571
After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.
Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.
Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.
# Tuning algorithm #1 - Random Forest
if (!muteEmail) email_notify(paste("Algorithm #1 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@396a51ab}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry=c(2,15,30,45,60))
fit.final1 <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.final1)
## Random Forest
##
## 146 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8357143 0.6634035
## 31 0.7747619 0.5451204
## 60 0.7880952 0.5716670
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
## user system elapsed
## 6.170 0.010 6.181
if (!muteEmail) email_notify(paste("Algorithm #1 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@548a9f61}"
# Tuning algorithm #2 - Gradient Boosting
if (!muteEmail) email_notify(paste("Algorithm #2 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1b604f19}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(nrounds=c(100,150,200,250,300), max_depth=3, eta=0.3, gamma=0, colsample_bytree=0.6, min_child_weight=1, subsample=1)
fit.final2 <- train(targetVar~., data=xy_train, method="xgbTree", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)
print(fit.final2)
## eXtreme Gradient Boosting
##
## 146 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 131, 131, 131, 131, 132, 132, ...
## Resampling results across tuning parameters:
##
## nrounds Accuracy Kappa
## 100 0.8561905 0.7095273
## 150 0.8495238 0.6945820
## 200 0.8495238 0.6945820
## 250 0.8495238 0.6945820
## 300 0.8495238 0.6945820
##
## Tuning parameter 'max_depth' was held constant at a value of 3
## 0.6
## Tuning parameter 'min_child_weight' was held constant at a value of
## 1
## Tuning parameter 'subsample' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 3,
## eta = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
## and subsample = 1.
proc.time()-startTimeModule
## user system elapsed
## 2.460 0.050 1.614
if (!muteEmail) email_notify(paste("Algorithm #2 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@55f3ddb1}"
results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: RF, GBM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.6000000 0.7892857 0.8333333 0.8357143 0.9321429 1 0
## GBM 0.6428571 0.7500000 0.8619048 0.8561905 0.9821429 1 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.1666667 0.5616826 0.6596494 0.6634035 0.8629344 1 0
## GBM 0.2857143 0.4950033 0.7232143 0.7095273 0.9642857 1 0
dotplot(results)
Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:
if (!muteEmail) email_notify(paste("Model Validation and Final Model Creation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@46d56d67}"
predictions <- predict(fit.final2, newdata=xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 25 4
## R 8 25
##
## Accuracy : 0.8065
## 95% CI : (0.6863, 0.8958)
## No Information Rate : 0.5323
## P-Value [Acc > NIR] : 6.468e-06
##
## Kappa : 0.6145
##
## Mcnemar's Test P-Value : 0.3865
##
## Sensitivity : 0.7576
## Specificity : 0.8621
## Pos Pred Value : 0.8621
## Neg Pred Value : 0.7576
## Prevalence : 0.5323
## Detection Rate : 0.4032
## Detection Prevalence : 0.4677
## Balanced Accuracy : 0.8098
##
## 'Positive' Class : M
##
pred <- prediction(as.numeric(predictions), as.numeric(y_test))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=TRUE)
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8098224
startTimeModule <- proc.time()
library(xgboost)
set.seed(seedNum)
# Combining the training and test datasets to form the original dataset that will be used for training the final model
xy_complete <- rbind(xy_train, xy_test)
y_final <- xy_complete$targetVar
xy_complete$targetVar <- NULL
x_final <- as.matrix(xy_complete)
finalModel <- xgboost(data=x_final, label=y_final, nrounds=200, max_depth=3, eta=0.3, gamma=0, colsample_bytree=0.6, min_child_weight=1, subsample=1)
## [1] train-rmse:0.803984
## [2] train-rmse:0.600953
## [3] train-rmse:0.464395
## [4] train-rmse:0.366991
## [5] train-rmse:0.299308
## [6] train-rmse:0.256248
## [7] train-rmse:0.217517
## [8] train-rmse:0.196324
## [9] train-rmse:0.177970
## [10] train-rmse:0.164598
## [11] train-rmse:0.155103
## [12] train-rmse:0.141579
## [13] train-rmse:0.133015
## [14] train-rmse:0.126360
## [15] train-rmse:0.121515
## [16] train-rmse:0.113750
## [17] train-rmse:0.108468
## [18] train-rmse:0.102441
## [19] train-rmse:0.095145
## [20] train-rmse:0.091136
## [21] train-rmse:0.086799
## [22] train-rmse:0.083983
## [23] train-rmse:0.081710
## [24] train-rmse:0.077199
## [25] train-rmse:0.075749
## [26] train-rmse:0.073273
## [27] train-rmse:0.070372
## [28] train-rmse:0.066862
## [29] train-rmse:0.065097
## [30] train-rmse:0.060902
## [31] train-rmse:0.059349
## [32] train-rmse:0.057318
## [33] train-rmse:0.055403
## [34] train-rmse:0.051558
## [35] train-rmse:0.049563
## [36] train-rmse:0.047672
## [37] train-rmse:0.046153
## [38] train-rmse:0.044706
## [39] train-rmse:0.043398
## [40] train-rmse:0.041250
## [41] train-rmse:0.040274
## [42] train-rmse:0.038911
## [43] train-rmse:0.037621
## [44] train-rmse:0.035641
## [45] train-rmse:0.033812
## [46] train-rmse:0.032290
## [47] train-rmse:0.031603
## [48] train-rmse:0.030984
## [49] train-rmse:0.030197
## [50] train-rmse:0.029279
## [51] train-rmse:0.028632
## [52] train-rmse:0.027892
## [53] train-rmse:0.027304
## [54] train-rmse:0.026544
## [55] train-rmse:0.025815
## [56] train-rmse:0.024534
## [57] train-rmse:0.023862
## [58] train-rmse:0.023203
## [59] train-rmse:0.022846
## [60] train-rmse:0.022465
## [61] train-rmse:0.021273
## [62] train-rmse:0.020742
## [63] train-rmse:0.020035
## [64] train-rmse:0.019481
## [65] train-rmse:0.019089
## [66] train-rmse:0.018774
## [67] train-rmse:0.018189
## [68] train-rmse:0.017689
## [69] train-rmse:0.016974
## [70] train-rmse:0.016630
## [71] train-rmse:0.016139
## [72] train-rmse:0.015161
## [73] train-rmse:0.014741
## [74] train-rmse:0.014248
## [75] train-rmse:0.013938
## [76] train-rmse:0.013585
## [77] train-rmse:0.012958
## [78] train-rmse:0.012424
## [79] train-rmse:0.012033
## [80] train-rmse:0.011696
## [81] train-rmse:0.011043
## [82] train-rmse:0.010800
## [83] train-rmse:0.010449
## [84] train-rmse:0.010220
## [85] train-rmse:0.009679
## [86] train-rmse:0.009455
## [87] train-rmse:0.009294
## [88] train-rmse:0.009022
## [89] train-rmse:0.008786
## [90] train-rmse:0.008420
## [91] train-rmse:0.008239
## [92] train-rmse:0.007926
## [93] train-rmse:0.007720
## [94] train-rmse:0.007574
## [95] train-rmse:0.007377
## [96] train-rmse:0.007070
## [97] train-rmse:0.006941
## [98] train-rmse:0.006618
## [99] train-rmse:0.006212
## [100] train-rmse:0.005889
## [101] train-rmse:0.005711
## [102] train-rmse:0.005442
## [103] train-rmse:0.005232
## [104] train-rmse:0.005112
## [105] train-rmse:0.004865
## [106] train-rmse:0.004741
## [107] train-rmse:0.004470
## [108] train-rmse:0.004343
## [109] train-rmse:0.004233
## [110] train-rmse:0.004028
## [111] train-rmse:0.003832
## [112] train-rmse:0.003740
## [113] train-rmse:0.003653
## [114] train-rmse:0.003535
## [115] train-rmse:0.003416
## [116] train-rmse:0.003296
## [117] train-rmse:0.003093
## [118] train-rmse:0.003049
## [119] train-rmse:0.002995
## [120] train-rmse:0.002895
## [121] train-rmse:0.002823
## [122] train-rmse:0.002752
## [123] train-rmse:0.002693
## [124] train-rmse:0.002625
## [125] train-rmse:0.002568
## [126] train-rmse:0.002507
## [127] train-rmse:0.002388
## [128] train-rmse:0.002349
## [129] train-rmse:0.002211
## [130] train-rmse:0.002107
## [131] train-rmse:0.002033
## [132] train-rmse:0.001902
## [133] train-rmse:0.001829
## [134] train-rmse:0.001781
## [135] train-rmse:0.001675
## [136] train-rmse:0.001626
## [137] train-rmse:0.001575
## [138] train-rmse:0.001550
## [139] train-rmse:0.001522
## [140] train-rmse:0.001479
## [141] train-rmse:0.001421
## [142] train-rmse:0.001397
## [143] train-rmse:0.001380
## [144] train-rmse:0.001353
## [145] train-rmse:0.001322
## [146] train-rmse:0.001290
## [147] train-rmse:0.001263
## [148] train-rmse:0.001245
## [149] train-rmse:0.001226
## [150] train-rmse:0.001204
## [151] train-rmse:0.001172
## [152] train-rmse:0.001143
## [153] train-rmse:0.001083
## [154] train-rmse:0.001066
## [155] train-rmse:0.001015
## [156] train-rmse:0.000957
## [157] train-rmse:0.000897
## [158] train-rmse:0.000875
## [159] train-rmse:0.000861
## [160] train-rmse:0.000844
## [161] train-rmse:0.000819
## [162] train-rmse:0.000788
## [163] train-rmse:0.000772
## [164] train-rmse:0.000753
## [165] train-rmse:0.000745
## [166] train-rmse:0.000728
## [167] train-rmse:0.000710
## [168] train-rmse:0.000703
## [169] train-rmse:0.000673
## [170] train-rmse:0.000648
## [171] train-rmse:0.000631
## [172] train-rmse:0.000631
## [173] train-rmse:0.000612
## [174] train-rmse:0.000612
## [175] train-rmse:0.000612
## [176] train-rmse:0.000612
## [177] train-rmse:0.000612
## [178] train-rmse:0.000612
## [179] train-rmse:0.000612
## [180] train-rmse:0.000612
## [181] train-rmse:0.000612
## [182] train-rmse:0.000612
## [183] train-rmse:0.000612
## [184] train-rmse:0.000612
## [185] train-rmse:0.000612
## [186] train-rmse:0.000612
## [187] train-rmse:0.000612
## [188] train-rmse:0.000612
## [189] train-rmse:0.000612
## [190] train-rmse:0.000612
## [191] train-rmse:0.000612
## [192] train-rmse:0.000612
## [193] train-rmse:0.000612
## [194] train-rmse:0.000612
## [195] train-rmse:0.000612
## [196] train-rmse:0.000612
## [197] train-rmse:0.000612
## [198] train-rmse:0.000612
## [199] train-rmse:0.000612
## [200] train-rmse:0.000612
print(finalModel)
## ##### xgb.Booster
## raw: 107.4 Kb
## call:
## xgb.train(params = params, data = dtrain, nrounds = nrounds,
## watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
## early_stopping_rounds = early_stopping_rounds, maximize = maximize,
## save_period = save_period, save_name = save_name, xgb_model = xgb_model,
## callbacks = callbacks, max_depth = 3, eta = 0.3, gamma = 0,
## colsample_bytree = 0.6, min_child_weight = 1, subsample = 1)
## params (as set within xgb.train):
## max_depth = "3", eta = "0.3", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsample = "1", silent = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## cb.evaluation.log()
## # of features: 60
## niter: 200
## nfeatures : 60
## evaluation_log:
## iter train_rmse
## 1 0.803984
## 2 0.600953
## ---
## 199 0.000612
## 200 0.000612
proc.time()-startTimeModule
## user system elapsed
## 0.38 0.00 0.21
#saveRDS(finalModel, "./finalModel_BinaryClass.rds")
if (!muteEmail) email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3532ec19}"
proc.time()-startTimeScript
## user system elapsed
## 88.510 1.280 100.183